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Abstract 

Sequential decision problems are often ap- 
proximately solvable by simulating possible 
future action sequences. Metalevel decision 
procedures have been developed for select- 
ing which action sequences to simulate, based 
on estimating the expected improvement in 
decision quality that would result from any 
particular simulation; an example is the re- 
cent work on using bandit algorithms to con- 
trol Monte Carlo tree search in the game 
of Go. In this paper we develop a theo- 
retical basis for metalevel decisions in the 
statistical framework of Bayesian selection 
problems, arguing (as others have done) that 
this is more appropriate than the bandit 
framework. We derive a number of basic 
results applicable to Monte Carlo selection 
problems, including the first finite sampling 
bounds for optimal policies in certain cases; 
we also provide a simple counterexample to 
the intuitive conjecture that an optimal pol- 
icy will necessarily reach a decision in all 
cases. We then derive heuristic approxima- 
tions in both Bayesian and distribution-free 
settings and demonstrate their superiority to 
bandit-based heuristics in one-shot decision 
problems and in Go. 



1 Introduction 

The broad family of sequential decision problems in- 
cludes combinatorial search problems, game playing, 
robotic path planning, model-predictive control prob- 
lems, Markov decision processes (MDP), whether fully 
or partially observable, and a huge range of applica- 
tions. In almost all realistic instances, exact solution 
is intractable and approximate methods are sought. 
Perhaps the most popular approach is to simulate a 



limited number of possible future action sequences, in 
order to find a move in the current state that is (hope- 
fully) near-optimal. In this paper, we develop a the- 
oretical framework to examine the problem of select- 
ing which future sequences to simulate. We derive a 
number of new results concerning optimal policies for 
this selection problem as well as new heuristic policies 
for controlling Monte Carlo simulations. As described 
below, these policies outperform previously published 
methods for "flat" selection and game-playing in Go. 

The basic ideas behind our approach are best ex- 
plained in a familiar context such as game playing. 
A typical game-playing algorithm chooses a move by 
first exploring a tree or graph of move sequences and 
then selecting the most promising move based on this 
exploration. Classical algorithms typically explore in a 
fixed order, imposing a limit on exploration depth and 
using pruning methods to avoid irrelevant subtrees; 
they may also reuse some previous computations (see 
Section 6.2). Exploring unpromising or highly pre- 



dictable paths to great depth is often wasteful; for a 
given amount of exploration, decision quality can be 
improved by directing exploration towards those ac- 
tions sequences whose outcomes are helpful in selecting 
a good move. Thus, the metalevel decision problem is 
to choose what future action sequences to explore (or, 
more generally, what deliberative computations to do) , 
while the object-level decision problem is to choose an 
action to execute in the real world. 

That the metalevel decision problem can itself be for- 
mulated and solved decision-theoretically was noted 



by Matheson (1968), borrowing from the related con- 
cept of information value theory (Howard 1966). In 
essence, computations can be selected according to 
the expected improvement in decision quality resulting 



from their execution. I. J. Good ( 1968[ ) independently 
proposed using this idea to control search in chess, and 
later defined "Type II rationality" to refer to agents 
that optimally solve the metalevel decision problem 
before acting. As interest in probabilistic and decision- 



theoretic approaches in AI grew during the 1980s, sev- 



eral authors explored these ideas further (Dean and 
Boddyl 



Horvitz 



1988: 



Doyle[ [19881 |Fehling and Breese 



1987). Work by Russell and Wefald 





1988 




1988 



1991a|b I formulated the metalevel sequential decision 



problem, employing an explicit model of the results of 
computational actions, and applied this to the control 
of game-playing search in Othello with encouraging re- 
sults. 

An independent thread of research on metalevel con- 



trol began with work by Kocsis and Szepesvari ( 2006 ) 



on the UCT algorithm, which operates in the context 
of Monte Carlo tree search (MCTS) algorithms. In 
MCTS, each computation takes the form of a simula- 
tion of a randomized sequence of actions leading from 
a leaf of the current tree to a terminal state. UCT 
is primarily a method for selecting a leaf from which 
to conduct the next simulation, and forms the core of 



the successful MoGo algorithm for Go playing (Gelly 



and Silver 2011). The UCT algorithm is based on 



the the theory of bandit problems (Berry and Frist 



edt 1985) and the asymptotically near-optimal UCB1 
bandit algorithm ( Auer et al. 2002 ) . UCT applies 



UCB1 recursively to select actions to perform within 
simulations. 

It is natural to consider whether the two indepen- 
dent threads are consistent; for example, are bandit 
algorithms such as UCB1 approximate solutions to 
some particular case of the metalevel decision prob- 
lem defined by Russell and Wefald? The answer, per- 
haps surprisingly, is no. The essential difference is 
that, in bandit problems, every trial involves execut- 
ing a real object-level action with real costs, whereas 
in the metareasoning problem the trials are simula- 
tions whose cost is usually independent of the util- 
ity of the action being simulated. Hence, as |Audibert| 
et al. (2010) and Bubeck et al. (2011) have also noted, 



UCT applies bandit algorithms to problems that are 
not bandit problems. 

One consequence of the mismatch is that bandit poli- 
cies are inappropriately biased away from exploring ac- 
tions whose current utility estimates are low. Another 
consequence is the absence of any notion of "stopping" 
in bandit algorithms, which are designed for infinite se- 
quences of trials. A metalevel policy needs to decide 
when to stop deliberating and execute a real action. 

Analyzing the metalevel problem within an appropri- 
ate theoretical framework ought to lead to more effec- 
tive algorithms than those obtained within the bandit 
framework. For Monte Carlo computations, in which 
samples are gathered to estimate the utilities of ac- 
tions, the metalevel decision problem is an instance 



of the selection problem studied in statistics (Bcch 



hofer 


1954 


Swisher et al. 


2003)). Despite some recent 


work 


Frazier and Powell 


2010 Tolpin and Shimony 


2012b 


), the theory of selection problems is less well 



understood than that of bandit problems. Most work 
has focused on the probability of selection error rather 



than optimal policies in the Bayesian setting (Bubeck 



et al.| 2011). Accordingly, we present in Sections [2 
and [3] a number of results concerning optimal policies 
for the general case as well as specific finite bounds on 
the number of samples collected by optimal policies 
for Bernoulli arms with beta priors. We also provide a 
simple counterexample to the intuitive conjecture that 
an optimal policy should not spend more on deciding 
than the decision is worth; in fact, it is possible for an 
optimal policy to compute forever. We also show by 



counterexample that optimal index policies (Gittins 



1989 ) may not exist for selection problems. 



Motivated by this theoretical analysis, we propose in 
Sections [3] and [5] two families of heuristic approxi- 
mations, one for the Bayesian case and one for the 
distribution-free setting. We show empirically that 
these rules give better performance than UCB1 on 
a wide range of standard (non-sequential) selection 
problems. Section [6] shows similar results for the case 
of guiding Monte Carlo tree search in the game of Go. 

2 On optimal policies for selection 

In a selection problem the decision maker is faced with 
a choice among alternative arm^] To make this choice, 
they may gather evidence about the utility of each of 
these alternatives, at some cost. The objective is to 
maximize the net utility, i.e., the expected utility of 
the final arm selected, less the cost of gathering the 



evidence. In the classical case (Bechhofer 19541, evi- 
dence might consist of physical samples from a product 
batch; in a metalevel problem with Monte Carlo simu- 
lations, the evidence consists of outcomes of sampling 
computations: 

Definition 1. A metalevel probability model is 

a tuple (Ui, ...,Uk,£) consisting of jointly distributed 
random variables: 

• Real random variables U\, . . . , Uk, where Ui is the 
utility of arm i, and 

• A countable set £ of random variables, each vari- 
able E £ £ being a computation that can be per- 
formed and whose value is the result of that com- 
putation. 

For simplicity, in the below we'll assume the utilities 
Ui are bounded, without loss of generality in [0, 1]. We 



Alternative actions are known as arms in the bandit 
setting; we borrow this terminology for uniformity. 



will abuse notation and denote by e G E that e is a 
potential value of the computation E. 

Example 1 (Bernoulli sampling). In the Bernoulli 
metalevel probability model, each arm will either 
succeed or not Ui G {0, 1}, with an unknown latent 
frequency of success Qi, and a set of stochastic sim- 
ulations of possible consequences £ — {Eij\l < i < 
k, j G N} that can be performed: 

Gi ~ Urdform[0, 1] fori e{l,...,k} 
Ui\Qi~ Bernoulli(e 4 ) for i G {1, ... , k} 

Eij | Qi ~ Bernoulli^) for i G {1, ... , k}, j G N 



The one-armed Bernoulli metalevel probability 
model has k = 2, 0i = A € [0, 1] a constant, and 
8 2 ~ Uniform[0, 1]. 



MDP M = (S, s , A S ,T, R) such that 

S = {_L} U {(ei . . . , e„) : ej e E l for all i, 

for finite n>Q and distinct Ei G £} 

so = 

A s = {±} U £ s 

where _L G S is the unique terminal state, where £ s C £ 
is a state- dependent subset of allowed computations, 
and when given any s = (e x , . . . ,e n ) G S, computa- 
tional action E G £ , and s' = (e x , . . . , e„, e) G S where 
e G E, we have: 

T{s, E, s') = P{E = e\E x =e x ,...,E n = e n ) 

T(s,T,T) = l 

R(s,E,s') = -c 

i?(s, _L, _L) = max^(s) 



A metalevel probability model, when combined with a 
cost of computation c > 00 defines a metalevel deci- 
sion problem: what is the optimal strategy with which 
to choose a sequence of computations E G £ in order 
to maximize the agent's net utility? Intuitively, this 
strategy should choose the computations that give the 
most evidence relevant to deciding which arm to use, 
stopping when the cost of computation outweighs the 
benefit gained. We formalize the selection problem as a 
Markov Decision Process (see, e.g., Puterman (1994)): 



Definition 2. A (countable state, undiscounted) 
Markov Decision Process (MDP) is a tuple M = 
(S, so, A s , T, R) where: S is a countable set of states, 
so G S is the fixed initial state, A s is a countable set 
of actions available in state s G S, T(s 1 a,s / ) is the 
transition probability from s G S to s' G S after per- 
forming action a G A s , and R(s,a,s') is the expected 
reward received on such a transition. 

To formulate the metalevel decision problem as an 
MDP, we define the states as sequences of computa- 
tion outcomes and allow for a terminal state when the 
agent chooses to stop computing and act: 

Definition 3. Given a metalevel probability mode^\ 
(Ui, . . . , C/fc, £) and a cost of computation c > 0, a 
corresponding metalevel decision problem is any 



2 The assumption of a fixed cost of computation is a 
si mplification; pr ecise conditions for its validity are given 



by 



Harada 



([19971. 

Definition [l] made no assumption about the compu- 
tational result variables _E, G £ , but for simplicity in the 
following we'll assume that each Ei takes one of a count- 
able set of values. Without loss of generality, we'll further 
assume the domains of the computational variables E G £ 
are disjoint. 



where ^(s) = E[Ui | E x = e 1: . . . , E n = e n }. 

Note that when stopping in state s, the expected util- 
ity of action i is by definition ^i(s), so the optimal ac- 
tion to take is i* G argmax i /i^ (s) which has expected 
utility [iis,(s) = maxi/Xi(s). 

One can optionally add an external constraint on the 
number of computational actions, or their total cost, 
in the form of a deadline or budget. This bridges with 



the related area of budgeted learning (Madani et al. 



2004). Although this feature is not formalized in the 



MDP, it can be added by including either time or past 
total cost as part of the state. 

Example 2 (Bernoulli sampling). In the Bernoulli 
metalevel probability model (Example^, note that: 

©i | En,..., E ini - Beta( Sl + l,f t + 1) (1) 

' Si + 1 ' 



Ei(m+i) I En,..., E ini - Bernoulli 



(2) 



Ui + 2 

E[Ui | E a , . . .,E ini ] - { Si + l)/(m + 2) (3) 

by standard properties of these distributions, where 
Si = YljLx Eim is the number of simulated successes of 
arm i, and /j = — Si the failures. By Equation 
the state space is the set of all k pairs (si,fi); Equa- 
tions and Q) suffice to give the transition proba- 
bilities and terminal rewards, respectively. The one- 
armed Bernoulli case is similar, requiring as state just 
(s,/) defining the posterior over 02- 

Given a metalevel decision problem M = 
(S, so, A s , T, R) one defines policies and value 
functions as in any MDP. A (deterministic, station- 
ary) metalevel policy n is a function mapping states 
s G S to actions to take in that state 7r(s) G A s . 



The value function for a policy ir gives the expected 
total reward received under that policy starting from a 
given state s £ S, and the Q-function does the same 
when starting in a state s £ S and taking a given 
action a £ A s : 



Em 



jv 



R(Si,ir(Si),Si+i) | S = s 



i=0 



(4) 



where N £ [0, oo] is the random time the MDP is 
terminated, i.e., the unique time where tt(Sn) = _L, 
and similarly for the Q-function QJ rI (s,a). 

As usual, an optimal policy ir* , when it exists, is one 
that maximizes the value from every state s £ S, i.e., 
if we define for each s £ S 

Vft( S )=supVft(«), 

then an optimal policy n* satisfies Vfa (s) = V^(s) for 
all s £ S, where we break ties in favor of stopping. 

The optimal policy must balance the cost of computa- 
tions with the improved decision quality that results. 
This tradeoff is made clear in the value function: 

Theorem 4. The value function of a metalevel deci- 
sion process M = (S, sq, A s , T, R) is of the form 



KA 



-c N + max/ij(Sjv) | So 



where N denotes the (random) total number of com- 
putations performed; similarly for QJ^(s, a). 

In many problems, including the Bernoulli sampling 
model of Example [2] the state space is infinite. Does 
this preclude solving for the optimal policy? Can in- 
finitely many computations be performed? 

There is in full generality an upper bound on the ex- 
pected number of computations a policy performs: 

Theorem 5. The optimal policy's expected number of 
computations is bounded by the value of perfect infor- 



mation (Howard 1966) times the inverse cost 1/c: 



E 



'* [N | S = s] < - (E[maxU t | S = s] - max/^(s) 

C V i i 



Further, any policy -k with infinite expected number of 
computations has negative infinite value, hence the op- 
timal policy stops with probability one. 

Although the expected number of computations is al- 
ways bounded, there are important cases in which the 
actual number is not, such as the following inspired by 



the sequential probability ratio test (Wald 1945): 

Example 3. Consider the Bernoulli sampling model 
for two arms but with a different prior: 0i = 1/2, 



and 02 is 1/3 or 2/3 with equal probability. Simu- 
lating arm 1 gains nothing, and after (s, /) simulated 
successes and failures of arm 2 the posterior odds ratio 
is 



P(Q 2 = 2/3 \s J) 



(2/3)'(l/3K = r - f 



P(0 2 = 1/3 \s,f) (1/3)^(2/3)/ 

Note that this ratio completely specifies the posterior 
distribution of 02, and hence the distribution of the 
utilities and all future computations. Thus, whether it 
is optimal to continue is a function only of this ratio, 
and thus of s — f. For sufficiently low cost, the optimal 
policy samples when s — f equals —1, 0, or 1. But 
with probability 1/3, a state with s — f = transitions 
to another state s — f = after two samples, giving 
finite, although exponentially decreasing, probability to 
arbitrarily long sequences of computations. 

However, in a number of settings, including the orig- 
inal Bernoulli model of Example [l] we can prove an 
upper bound on the number of computations. For 
reasons of space, and for its later use in Section |4j 
we prove here the bound for the one-armed Bernoulli 
model. 

Before we can do this, we need to get an analytical 
handle on the optimal policy. The key is through a 
natural approximate policy: 

Definition 6. Given a metalevel decision problem 
M = (S, Sq, A s , T, R), the myopic policy 7r m (s) is de- 
fined to equal argmax agy!l Q m (s,a) where Q m (s,_L) = 
maxi jtij(s) and 

Q m {s,E) = E M [-c + maxtn(Si) \S = s,A = E}. 

i 

The myopic policy (known the metalevel greedy ap- 
proximation with single-step assumption in |Russell| 
and Wefald (1991a)) takes the best action, to either 
stop or perform a computation, under the assump- 
tion that at most one further computation can be per- 
formed. It has a tendency to stop too early, because 
changing one's mind about which real action to take 
often takes more than one computation. In fact, we 
have: 

Theorem 7. Given a metalevel decision problem M = 
(S, So, A s , T, R) if the myopic policy performs some 
computation in state s £ S, then the optimal policy 
does too, i.e., if 7T m (s) ^ _L then 7r*(s) ^ _L. 

Despite this property, the stopping behavior of the my- 
opic policy does have a close connection to that of the 
optimal policy: 

Definition 8. Given a metalevel decision problem 
M = (S, so, A s , T, R), a subset S 1 C S of states 
is closed under transitions if whenever s' £ S' , 
a £ A s f, s" £ S, and T(s' , a, s") > 0, we have s" £ S' . 



Theorem 9. Given a metalevel decision problem M = 
(S, so, A s , T, R) and a subset S' C S of states closed 
under transitions, if the myopic policy stops in all 
states s' € S' then the optimal policy does too. 



the known value X. As the context X varies the op- 
timal action inverts from observing 1 to observing 2. 
Inversions like this are impossible for index policies. 



Using these results connecting the behavior of the op- 
timal and myopic policies, we can prove our bound: 

Theorem 10. The one-armed Bernoulli decision pro- 
cess with constant arm A G [0,1] performs at most 
A(l — A)/c — 3 < l/4c — 3 computations. 

Proof. Using Definition [6] and Example [2j we deter- 
mine which states the myopic policy stops in by bound- 
ing Q m {s,E). For a state (a,/), let n = (s + l)/(n+2) 
be the mean utility for arm 2, where n = s + f. Fixing 
n and maximizing over fi, we get sufficient condition 
for stopping Since the set of states satisfying Equa- 
tion ([5| is closed under 



0.3 



c > 



A(l-A) 
(n + 3) 



n > 



A(l - A) 



(5) 



Since the set of states satisfying Equation ([5| is closed 
under transitions (n only increases), by Theorem [7] 
Finally, note max Ae [ 01 ] A(l — A) = 1/4. □ 

A key implication is that the optimal policy can be 
computed in time 0(l/c 2 ), i.e., quadratic in the in- 
verse cost. This is particularly appropriate when the 
cost of computation is relatively high, such as in simu- 



lation experiments ( Swisher et al. 2003 ) , or when the 



decision to be made is critical. 



3 Context effects and non-indexability 
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Figure 1: Optimal Q- values of computing relative to 
stopping as a function of the utility of the fixed alter- 
native. Note the inversion where for low A observing 
action 1 is strictly optimal, while for medium A ob- 
serving action 2 is strictly optimal. 

There is, however, a restriction on what kind of influ- 
ence the context can have: 

Definition 11. Given a metalevel decision problem 
M = (S, sq, A S ,T, R), and a constant A € K, define 
M\ = (S, so, A Sl T, R\) to be M with an additional 
action of known value X, defined by: 

R x (s,E,s') = R(s,E,s') 

R x {s, _L, _L) = max{A, R(s, _L, _L)} 



The Gittins index theorem (Gittins 1979) is a famous 



structural result for bandit problems, ft states that in 
bandit problems with independent reward distribution 
for each arm and geometric discounting, the optimal 
policy is an index policy: each arm is assigned a 
real- valued index based on its state only, such that it 
is optimal to sample the arm with greatest index. 

The analogous result does not hold for metalevel de- 
cision problems, even when the action's values are in- 



dependent (this formalized later in Definition 13): 

Example 4 (Non-indexability). Consider a metalevel 
probability model with three actions. U± is equally 
likely to be —1.5 or 1.5 (low mean, high variance), V% 
is equally likely to be 0.25 or 1.75 (high mean, low vari- 
ance), and U3 = X has a known value (the context). 
The two computations are to observe exactly U\ and 
U2, respectively, each with cost 0.2. The corresponding 
metalevel MDP has 9 states and can be solved exactly. 
F igure [7] plots Q*\(sq, Ui) — Q*\{so, -L) as a function of 



Note this is equivalent to adding an extra arm with 
constant value Uk+i = X. 

Theorem 12. Given a metalevel decision problem 
M = (S,sq,A s ,T,R), there exists a real interval J(s) 
for every state seS such that it is optimal to stop in 
state s in iff fJ, ^ I{s). Furthermore, I{s) contains 
maxi/Xi(s) whenever it is nonempty. 

4 The blinkered policy 

The myopic policy is an extreme approximation, often 
stopping far too early. A better approximation can be 
obtained, at least for the case where each computation 
can only affect the value of one action. The technical 
definition (closely related to subtree independence in 
Russell and Wefald's work) is as follows: 

Definition 13. A metalevel probability model 
(Ui, . . . , Uk,£) has independent actions if the com- 
putational variables can be partitioned £ = £ % U- • • U£fc 



such that such that the sets {Ui} U £i are independent 
of each other for different i. 

With independent actions, we can talk about metalevel 
policies that focus on computations affecting a sin- 
gle action. These policies are not myopic — they can 
consider arbitrarily many computations — but they are 
blinkered because they can look in only a single direc- 
tion at a time: 

Definition 14. Given a metalevel decision prob- 
lem M = (S, sq, A s , T, R) with independent actions, 



the blinkered policy n 



is 



defined by 7T (s) 



argmax a6j4 Q b (s,a) where Q b (s,_L) = _L and for Ei G 
Si 

Q\s,E l ) = sup Q*(a,Ei) (6) 

where H b is the set of policies n where 7r(s) £ £i for 
all s € S . 

Clearly, blinkered policies are better than myopic: 
Q rn (s,a) < Q b (s,a) < Q*(s,a). Moreover, the blink- 
ered policy can be computed in time proportional to 
the number of arms, by breaking the decision problem 
into separate subproblems: 

Definition 15. Given a metalevel decision problem 
M — (S, so, A S ,T, R) with independent actions, a 
one-action metalevel decision problem for i = 
l,...,k is the metalevel decision problem M^ x = 
(Si, So, A s0 ,Ti, Ri) defined by the metalevel probability 
model (Uq, Ui,£i) with Uq = A. 

Note that given a state s of a metalevel decision prob- 
lem, we can form a state Si by taking only the results 
of computations in £ j (see Definition [3]) . By action 
independence, /Ltj(s) is a function only of Sj. 

Theorem 16. Given a metalevel decision problem 
M = (S, sq, A s , T, R) with independent actions, let 
M} x . be the ith one-action metalevel decision prob- 



lem for i = 1, 
Ei £ A s n S, 



k. Then for any s € S, whenever 
we have: 



Q b M (s,E l ) = Q 



M 1 



(si,Ei) 



where /i* 



^iflj{s). 



Theorem |T6| shows that to compute the blinkered pol- 
icy we need only compute the optimal policies for k 
separate one-action problems. 

For the Bernoulli problem with k actions, the one- 
action metalevel decision problems are all one-action 
Bernoulli problems (Example]!]). By Theorem 10 these 
policies perform at most l/4c — 3 computations. As 
a result, the blinkered policy can be numerically com- 
puted in time 0(D/c 2 ) independent of k by backwards 
induction, where D is the number of points A e [0, 1] 



for which we compute Q* M i ( S )J^J This will be worth 
the cost in the same situations as mentioned at the 
end of Section [2] 

Figure [2] compares the blinkered policy to several other 
policies from the literature, using a Bernoulli sampling 
problem with k = 25 and a wide range of values for 
the step cost c. Performance is measured by expected 
regret, where the regret includes the cost of sampling: 
R = (maxi Ui) — Uj+cn where n is the number of com- 
putations and j is the action actually selected. The 
blinkered policy significantly outperforms all others. 
The myopic policy plateaus as it quickly reaches a po- 
sition where no single computation can change the final 
action choice. ESPb performs quite well given that is 
making a normal approximation to the Beta posterior. 
The curves for UCB1-B and UCBl-b show that even 
given a good stopping rule, UCBl's choice of actions 
to sample is not ideal. 
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Figure 2: Average regret of various policies as a func- 
tion of the cost in a 25-action Bernoulli sampling prob- 
lem, over 1000 trials. Error bars omitted as they are 
negligible (the relative error is at most 0.03). 



5 Upper bounds on Value of 
Information 

In many practical applications of the selection prob- 
lem, such as search in the game of Go, prior distri- 
butions are unavailable]^] In such cases, one can still 
bound the value of information of myopic policies us- 
ing concentration inequalities to derive distribution- 
independent bounds on the VOL We obtain such 
bounds under the following assumptions: 



4 in our experiments below, D — 129 points are equally 
spaced, using linear interpolation between points. 

The analysis is also applicable to some Bayesian set- 
tings, using "fake" samples to simulate prior distributions. 



1. Samples are iid given the value of the arms 
(variables), as in the Bayesian schemes such as 
Bernoulli sampling. 

2. The expectation of a selection in a belief state is 
equal to the sample mean (and therefore, after 
sampling terminates, the arm with the greatest 
sample mean will be selected). 

When considering possible samples in the blinkered 
semi-myopic setting, two cases are possible: either the 
arm a with the highest sample mean X a is tested, and 
X a becomes lower than Xp of the second-best arm j3; 
or, another arm i is tested, and Xi becomes higher 
than X a . 

Our bounds below are applicable to any bounded dis- 
tribution (without loss of generality bounded in [0, 1]). 
Similar bounds can be derived for certain unbounded 
distributions, such as the normally distributed prior 
value with normally distributed sampling. We derive 
a VOI bound for testing an arm a fixed N times, where 
iV can be the remaining budget of available samples or 
any other integer quantity. Denote by the intrinsic 
VOI of testing the ith arm N times, and the number 
of samples already taken from the ith arm by n^. 

Theorem 17. is bounded from above as 



At < 



NX 



' Pr(X: a+N <X n /) 



A' 



< 



Pr(X 



rii+N 



The probabilities can be bounded from above using 



the Hocffding inequality (Hoeffding 19631 



Theorem 18. The probabilities in Equation |?|) 
bounded from above as 



Pr(AV < V ) < 2 ex P " ~ V )X 



Pr(X^> XJ) < 2 exp -^C - X^YnA (8) 



where <p = win [2{^=) 2 j = 8(^2 - l) 2 > 1.37. 

Corollary 19. An upper bound on the VOI estimate 
Aj is obtained by substituting Equation into 



A° < A* 



2NX 







exp 



X 



l /3\2„ 



A^ Q < AJ = 2JV(1 ;"° exp (- V (X?- X7?n„ 



ity ( Maurer and Pontil 2009 1 , or through a more care- 



ful application of the Hoeffding inequality, resulting in 



At < 



crf((l-Xr)V^)-erf((AC 



(10) 



Selection problems usually separate out the decision 
of whether to sample or to stop (called the stopping 
policy), and what to sample. We'll examine the first 
issue here, along with the empirical evaluation of the 
above approximate algorithms, and the second in the 
following section. 

Assuming that the sample costs are constant, a semi- 
myopic policy will decide to test the arm that has the 
best current VOI estimate. When the distributions 
are unknown, it makes sense to use the upper bounds 
established in Theorem [17J as we do in the following. 
This evaluation assumes a fixed budget of samples, 
which is completely used up by each of the candidate 
schemes, making a stopping criterion irrelevant. 
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Figure 3: Average regret of various policies as a func- 
tion of the fixed number of samples in a 25-action 
Bernoulli sampling problem, over 10000 trials. 

The sampling policies are compared on random 
Bernoulli selection problem instances. Figure [3] shows 
results for randomly-generated selection problems with 
25 Bernoulli arms, where the mean rewards of the arms 
are distributed uniformly in [0, 1] , for a range of sample 
budgets 200. .2000, with multiplicative step of 2, aver- 
aging over 10000 trials. We compare UCB1 with the 
policies based on the bounds in Equation ([9| (VOI) 
and Equation (10) (VOI+). UCB1 is always consider- 



(9) ably worse than the VOI-aware sampling policies. 



6 Sampling in trees 



More refined bounds can be obtained through tighter 
estimates on the probabilities in Equation for 
example, based on the empirical Bernstein inequal- 



The previous section addressed the selection problem 
in the flat case. Selection in trees is more compli- 
cated. The goal of Monte-Carlo tree search (Chaslot 



et al. 2008) at the root node is usually to select an 



action that appears to be the best based on outcomes 
of search rollouts. But the goal of rollouts at non-root 
nodes is different than at the root: here it is impor- 
tant to better approximate the value of the node, so 
that selection at the root can be more informed. The 
exact analysis of sampling at internal nodes is outside 
the scope of this paper. At present we have no better 
proposal for internal nodes than to use UCT there. 

We thus propose the following hybrid sampling scheme 



(Tolpin and Shimony 2012a I: at the root node, sample 



based on the VOI estimate; at non-root nodes, sample 
using UCT. 

Strictly speaking, even at the root node the station- 
arity assumption^ underlying our belief-state MDP 
for selection do not hold exactly. UCT is an adap- 
tive scheme, and therefore the values generated by 
sampling at non-root nodes will typically cause val- 
ues observed at children of the root node to be non- 
stationary. Nevertheless, sampling based on VOI esti- 
mates computed as for stationary distributions works 
well in practice. As illustrated by the empirical eval- 
uation (Section [6]), estimates based on upper bounds 
on the VOI result in good sampling policies, which ex- 
hibit performance comparable to the performance of 
some state-of-the-art heuristic algorithms. 

6.1 Stopping criterion 

When a sample has a known cost commensurable with 
the value of information of a measurement, an upper 
bound on the intrinsic VOI can also be used to stop 
the sampling if the intrinsic VOI of any arm is less 
than the total cost of sampling C: max; Aj < C. 

The VOI estimates of Equations Q and ^ include 
the remaining sample budget N as a factor, but given 
the cost of a single sample c, the cost of the remaining 
samples accounted for in estimating the intrinsic VOI 
is C — cN. N can be dropped on both sides of the 
inequality, giving a reasonable stopping criterion: 



1 

N 



A* <^Pr(JC"" <X"f?) r 



-n a +N 



-maxA" <max- 

N i i rii 



yi : i =/= a 



Pr(XT >K;)<c 

(11) 



The empirical evaluation (Section [6]) confirms the via- 
bility of this stopping criterion and illustrates the in- 
fluence of the sample cost c on the performance of the 
sampling policy. When the sample cost c is unknown, 



This is not a restriction, however, of the general for- 
malism in Section [2] 



one can perform initial calibration experiments to de- 
termine a reasonable value, as done in the following. 

6.2 Sample redistribution in trees 

The above hybrid approach assumes that the informa- 
tion obtained from rollouts in the current state is dis- 
carded after an real-world action is selected. In prac- 
tice, many successful Monte-Carlo tree search algo- 
rithms reuse rollouts generated at earlier search states, 
if the sample traverses the current search state during 
the rollout; thus, the value of information of a rollout 
is determined not just by the influence on the choice of 
the action at the current state, but also by its potential 
influence on the choice at future search states. 

One way to account for this reuse would be to incor- 
porate the 'future' value of information into a VOI 
estimate. However, this approach requires a nontriv- 
ial extension of the theory of metareasoning for search. 
Alternately, one can behave myopically with respect to 
the search tree depth: 

1. Estimate VOI as though the information is dis- 
carded after each step, 

2. Stop early if the VOI is below a certain threshold 



(see Section 6.1), and 



3. Save the unused sample budget for search in fu- 
ture states, such that if the nominal budget is N, 
and the unused budget in the last state is N u , the 
search budget in the next state will be N + N u . 

In this approach, the cost c of a sample in the current 
state is the VOI of increasing the budget of a future 
state by one sample. It is unclear whether this cost can 
be accurately estimated, but supposing a fixed value 
for a given problem type and algorithm implemen- 
tation would work. Indeed, the empirical evaluation 



(Section 6.3) confirms that stopping and sample redis- 



tribution based on a learned fixed cost substantially 
improve the performance of the VOI-based sampling 
policy in game tree search. 

6.3 Playing Go against UCT 

The hybrid policies were compared on the game Go, a 
search domain in which UCT-based MCTS has been 
particularly successful (Gelly and Wang 20061. A 



modified version of Pachi (Braudis and Loup Gailly 



2011), a state of the art Go program, was used for the 



experiments: 

• The UCT engine of Pachi was extended with VOI- 
aware sampling policies at the first step. 



• The stopping criterion for the VOI-aware policy 
was modified and based solely on the sample cost, 
specified as a constant parameter. The heuristic 
stopping criterion for the original UCT policy was 
left unchanged. 

• The time-allocation model based on the fixed 
number of samples was modified for both the orig- 
inal UCT policy and the VOI-aware policies such 
that 

— Initially, the same number of samples is avail- 
able to the agent at each step, independently 
of the number of pre-simulated games; 

— If samples were unused at the current step, 
they become available at the next step. 

While the UCT engine is not the most powerful en- 
gine of Pachi, it is still a strong player. On the other 
hand, additional features of more advanced engines 
would obstruct the MCTS phenomena which are the 
subject of the experiment. The engines were com- 
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Figure 4: Winning rate of the VOI-aware policy in 
Go as a function of the cost c, for varying numbers of 
samples per ply. 

pared on the 9x9 board, for 5000, 7000, 1000, and 
15000 samples (game simulations) per ply, each exper- 
iment repeated 1000 times. Figure [4] depicts a cali- 
bration experiment, showing the winning rate of the 
VOI-aware policy against UCT as a function of the 
stopping threshold c (if the maximum VOI of a sam- 
ple is below the threshold, the simulation is stopped, 
and a move is chosen). Each curve in the figure cor- 
responds to a certain number of samples per ply. For 
the stopping threshold of 10~ 6 , the VOI-aware policy 
is almost always better than UCT, and reaches the 
winning rate of 64% for 10000 samples per ply. 

Figure [5] shows the winning rate of VOI against UCT 
c = 10 -6 . In agreement with the intuition (Fig- 
ure [(T^, VOI-based stopping and sample redistribu- 
tion is most influential for intermediate numbers of 
samples per ply. When the maximum number of sam- 
ples is too low, early stopping would result in poorly 
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Figure 5: Winning rate of the VOI-aware policy in 
Go as a function of the number of samples, fixing cost 
c= 10- 6 . 

selected moves. On the other hand, when the maxi- 
mum number of samples is sufficiently high, the VOI 
of increasing the maximum number of samples in a 
future state is low. 

Note that if we disallowed reuse of samples in both 
Pachi and in our VOI-based scheme, the VOI based- 
scheme win rate is even higher than shown in Figure [5] 
This is as expected, as this setting (which is somewhat 
unfair to Pachi) is closer to meeting the assumptions 
underlying the selection MDP. 

7 Conclusion 

The selection problem has numerous applications. 
This paper formalized the problem as a belief-state 
MDP and proved some important properties of the 
resulting formalism. An application of the selection 
problem to control of sampling was examined, and the 
insights provided by properties of the MDP led to ap- 
proximate solutions that improve the state of the art. 
This was shown in empirical evaluation both in "flat" 
selection and when extending the methods to game- 
tree search for the game of Go. 

The methods proposed in the paper open up several 
new research directions. The first is a better approx- 
imate solution of the MDP, that should lead to even 
better flat sampling algorithms for selection. A more 
ambitious goal is extending the formalism to trees — 
in particular, achieving better sampling at non-root 
nodes, for which the purpose of sampling differs from 
that at the root. 
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